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1. INTRODUCTION 

This paper summarizes the results of a 

benchmarking exercise conducted as part of the NASA 
supported Advanced Satellite Aviation-Weather 
Products (ASAP) Program. The goal of ASAP is to 
increase and optimize the use of satellite data sets 
within the existing FAA Aviation Weather Research 
Program (AWRP) Product Development Team (PDT) 
structure and to transfer advanced satellite expertise to 
the PDTs. Currently, ASAP fosters collaborative efforts 
between NASA Laboratories, the University of 

Wisconsin Cooperative Institute for Meteorological 
Satellite Studies (UW-CIMSS), the University of 

Alabama in Huntsville (UAH), and the AWRP PDTs. 
This collaboration involves the testing and evaluation of 
existing satellite algorithms developed or proposed by 
AWRP teams, the introduction of new techniques and 
data sets to the PDTs from the satellite community, and 
enhanced access to new satellite data sets available 
through CIMSS and NASA Langley Research Center for 
evaluation and testing. 

The In-Flight Icing PDT (IFIPDT) developed the 
Current Icing Potential (CIP) product, which is now run 
operationally at the National Weather Service’s Aviation 
Weather Center. This product combines model output 
with observational data to provide an hourly, three- 
dimensional, gridded depiction of icing potential. While 
CIP incorporates GOES information it does so in a 
relatively rudimentary manner as a cloud mask. For the 
IFIPDT, the CIP is a natural target for enhancement 
using advanced satellite products, such as those being 
developed at NASA’s Langley Research Center (LaRC). 
We anticipate that the accuracy of the CIP should 
improve as it is extended to include clout top phase, 
effective particle size and other attributes of icing 
severity. 

IFIPDT members have already been examining the 
NASA advanced satellite products during forecasting 
exercises conducted in support of field research projects 
conducted during the past two winter seasons. 
Anecdotally, the forecasters have found these products 
to be highly useful for flight planning and directing icing 
flight missions by helping to pinpoint the locations of 
icing conditions. It is now time to provide quantitative 
information on the accuracy of the products. 
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In this paper we begin a process of benchmarking 
some of the advanced satellite products from LaRC, 
through comparison with CIP and with PIREPs. In a 
broader sense, this paper also begins an examination of 
possible verification methods and strategies appropriate 
for new products that incorporate high resolution 
satellite imagery. 

2. NASA LANGLEY SATELLITE ICING AND CLOUD 

PRODUCTS 

The Langley cloud products (Minnis et al. 2001) are 
derived from half-hourly Geostationary Operational 
Environmental Satellite (GOES) data taken from GOES- 
10 (West) and GOES-12 (East) using the Visible 
Infrared Solar-infrared Split-window Technique (VISST) 
during the daytime (Minnis et al. 1995, 1998). Each 4- 
km GOES pixel is first classified as clear or cloudy using 
a complex cloud identification scheme (Trepte et al. 
1999). Each of the cloudy pixels is analyzed with the 
VISST to determine cloud phase, optical depth, effective 
particle size, effective temperature, effective height, and 
ice or liquid water path. These parameters are used to 
estimate cloud-top and base altitudes and temperatures. 
The analyses utilize the 0.65, 3.9, 10.8, and the 12 or 13 
pm GOES imager channels. 

A prototype diagnostic aircraft icing index termed 
“risk factor” (Minnis et al. 2003, 2004) has already been 
developed through diagnostic comparisons of the 
Langley cloud products with pilot icing reports (Smith et 
al. 2000, 2002, 2003). Critical observations include the 
cloud top temperature T c , cloud optical depth x, cloud 
phase, cloud droplet effective radius re, as well as the 
liquid water path, LWP. The prototype criteria are 
summarized in Table 1. The logic tree first checks 
whether the pixel is clear or not. If clear, it is eliminated 
from further consideration. If cloudy, it will still be 
eliminated from further processing if the T c > 272 K (too 
warm for icing), or if it is classified as an ice cloud with x 
< 8. This latter criterion is based on an expectation that 
the optical depth would be larger than this if a significant 
water cloud were to exist beneath the ice cloud. Any 
residual water below the ice cloud is not likely to be able 
to product significant icing. The remaining pixels are 
then examined further to evaluate their potential to 
cause icing. 

If the observed cloud is classified as ice and has a 
large optical depth (x > 8), it is recognized that there is a 
possibility that a significant water cloud exists within or 
below the ice cloud, but this can not be resolved by the 
satellite observations alone. The pixel is therefore 
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classified as unknown or indeterminate. It will require 
further study or combination with the other CIP data to 
arrive at a firmer classification. The remaining pixels 
are all classified as supercooled water clouds. They are 
classified as having no, low, middle, or high icing 
potential based on the values of r e , T c , and LWP as 
described in Table 1. Any supercooled water cloud 
pixels having characteristics that do not satisfy any of 
the criteria 2 - 9 in Table 1 are not considered to be 
icing threats. 


Table 1: Protogype icing classification criteria for 
NASA LaRC satellite-based products. 


Value 

Criterion 


Icing 

Intensity 


clear or water cloud 



0 

w/ T c > 272 K 
or LWP < 100 gm' 2 


none 


or ice cloud with x < 

8) 


1 

ice cloud with x > 8 


unknown 


LWP 

r e (Mm) _ 2 

(gm ) 

Tc(K) 


2 

>10 >100 

<272 

low 

3 

>10 >200 

<272 

mid 

4 

>10 >300 

<272 

high 

5 

>8 >400 

<272 

low 

6 

>8 >500 

<272 

mid 

7 

>10 >300 

<253 

high 

8 

>8 >400 

<253 

high 


An example of the icing product is shown in 
Figure 1 . Areas with icing potential are apparent over 
Iowa, western Tennessee, northern North Dakota, 
around North Carolina, and in the Pacific Northwest. 
Extensive regions of high cloud cover create a number 
of areas of “indeterminate” icing. Small areas with 
potential icing are identified around the edges of some 
cirrus clouds due to overlapping conditions and edge 
effects. 

As an initial efforts in our benchmarking process we 
will examine some of the component fields that are used 
in the LaRC icing classification scheme, starting with 
the derived cloud phase field. The usefulness of the 
cloud phase field will be evaluated through a 
systematic and extensive set of comparisons to 
PIREPs obtained during 2003 and 2004, and to the 
current CIP. The systematic approach making use of all 



Figure 1. Cloud and icing data for 1915 UTC, 
15 March 2004: (a) Stitched GOES-10/12 infrared 

brightness temperature, (b) Icing categories. 

available PIREPs is designed to get beyond qualitative 
tests that look for patterns that seem realistic and 
emphasize anecdotal examples of PIREPs in key areas. 

3. THE CURRENT ICING POTENTIAL (CIP) 
PRODUCT 

The first In-Flight Icing PDT product to be 
transferred to operational use in the National Weather 
Service was the Current Icing Potential (CIP, 
McDonough et al., 2000 and Bernstein et al. 2004). The 
CIP algorithm applies fuzzy logic techniques to combine 
up to fifty-six interest fields into one fused product. CIP 
combines data from five sources-multispectral GOES 
imagery, model output from the RUC model, surface 
observations, N EXRAD radar data, and pilot reports -- 
and is available on the Aviation Digital Data Service 
(ADDS) web page at: 

http : / / adds . aviationweather . noaa . gov 

Figure 2 shows an example of CIP hourly output, with 
the letters shown on the figure representing the icing 
type, as reported in a PIREP. The size of the letters is 
related to the reported icing severity. While CIP icing 
potential values are usually reported as decimal values 
between 0 and 1, the color coded scale for this chart is 
presented as icing potential x 100. 
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Figure 2: Example of CIP hourly output. This example 
shows the maximum icing potential value in any 20-km 
gridded RUC column, for 1800 UTC, 14 Nov. 2003. 
R=rime; C=clear; X=mixed and U=unknown ice type. 


3. 1 Verification Methods 

The CIP verification was accomplished by 
evaluating the icing potential field versus pilot reports 
(PIREPs) of positive and negative icing. Each PIREP 
was matched to the closest CIP grid point and flight 
level. The four grid points surrounding the observation, 
as well as 1,000-ft 1 flight levels above and below the 
PIREP, were examined. Currently, CIP incorporates 
information from PIREPs in the hour prior to the forecast 
time. This analysis therefore only used observations 
(PIREPs) in a time window of one hour following the 
forecast valid time. Statistics were then computed and 
analyzed. The verification statistics for CIP were then 
compared to the Airmens’ Meteorological Advisories 
(AIRMETs). Although the AIRMETs are generally 
different than CIP (e.g., they cover a relatively broad 
volume and are intended to depict icing over a six hour 
period), the comparison is included because AIRMETs 
are a readily available operational icing forecast 
produced by forecasters at the Aviation Weather Center 
(AWC). 

The verification methods utilized in the evaluation of 
CIP are based on standard verification concepts that 
recognize the underlying framework for forecast 
verification and the associated high dimensionality of 
the verification problem. The methods used are 
described in greater detail in Brown (1996). The 
specific icing forecast verification methodology outlined 
by Brown et al. (1997) treats icing forecasts and 
observations as YES/NO values. Brown et al. (1999) 
outline how this method can be extended to forecasts 
with values on a continuous scale. Specifically, icing 


1 Note that CIP outputs altitudes in ft rather than meters, since 
these are the units used by the aviation end-users. We will 
retain these units in this paper rather than converting to metric. 


diagnoses produced by CIP can be converted to a set of 
YES/NO values by applying a variety of thresholds. For 
example, applying a threshold of 0.30 to CIP diagnoses 
would lead to a YES value for all grid points with an 
icing potential value greater than or equal to 0.30 while 
each grid point with a value less than 0.30 would be 
assigned a NO value. The verification methods are 
based on a standard YES/NO two-by-two contingency 
table (Table 2). Each cell in this table contains a count 
of the number of times a particular forecast and 
observation pair was observed. The counts on this 
table are observation-based (i.e., the sum of the counts 
is the total number of YES and NO PIREPs over the 
given time period) and therefore not all CIP grid points 
are represented. 

Table 2: Contingency table for evaluation of 

dichotomous (e.g., Yes/No) forecasts. The elements 
in the cells are the counts of forecast-observation 
pairs. 


Forecast 

Observations 

Total 

Yes 

No 

Yes 

YY 

YN 

YY+YN 

No 

NY 

NN 

NY+NN 

Total 

YY+NY 

YN+NN 

YY+YN+NY+NN 


PODy and PODn are the primary verification 
statistics that are included in this evaluation. They are 
estimates of the proportions of Yes and No observations 
that are correctly diagnosed. Together, PODy and 
PODn measure the ability of the diagnoses to 
discriminate between Yes and No icing observations. 
Table 3 gives the definition and description of these 
statistics. 


Table 3: Verification Statistics used in evaluation 
of CIP. 


Statistic 

Definition 

Description 

PODy 

YY/(YY+NY) 

Probability of 
detection of “YES” 
observations 

PODn 

NN/(YN+NN) 

Probability of 
Detection of “No” 
observations 

TSS 

PODY + PODn -1 

Level of 
discrimination 
between YES and 
NO observations 
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The relationship between PODy and 1-PODn for 
different thresholds is the basis for the veri-fication 
approach known as “Signal Detection Theory” (SDT). 

This relationship can be represented for a given 
algorithm with the curve joining the (1-PODn, PODy) 
points for different thresholds. The resulting curve is ~ 
known as the Relative Operating Characteristic (ROC) 
curve in SDT. The closer this curve comes to the upper 
left corner, the better the diagnosis. The area under the 
curve is a measure of overall forecast skill and provides 
another measure that can be compared among forecast 
products. This measure is not dependent on the 
threshold used. A forecast with zero skill would have an 
ROC area of 0.5 or less. 


3.2 CIP Data Used for Benchmark Verification 

The valid time period for this evaluation was 
1 October 2003 to 31 March 2004. The 1500 UTC and 
2100 UTC valid times were compared; a total of 341 
CIP files were available for verification. All available 
AIRMETs valid for the specific CIP times were 
incorporated. The total number of PIREPs used as 
observations in the evaluation of CIP and the AIRMETs 
is listed in Table 4. 


Table 4: Icing observations included in the CIP 
verification 


Observation 

Number of pilot 
reports 

NO 

8346 

YES (MOG) 

1767 


3.3 Results 

CIP is relatively efficient at detecting icing 
conditions of detection for MOG YES reports 
[PODy(MOG) of 0.74-0.82, see Figure 3], and a 
corresponding probability of detection of NO reports 
(PODn) between 0.62-0.68, depending on the CIP 
threshold used. AIRMETs detected a similar number of 
YES reports with a [PODy(MOG)] of 0.751 also 
capturing a similar number of NO reports with a PODn 
of 0.635. For the CIP evaluation, the eleven thresholds 
used were 0.00, 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 
0.75, 0.85, and 0.95. For CIP, the icing potential field is 
thresholded. For each available PIREP four CIP grid 
points are compared with 64 GDCP cloud phase pixels. 
Figure 4 is a ROC plot of all of the CIP thresholds for all 
of the valid times with the statistics for the AIRMET 
overlaid. It is a sufficient depiction of how skillful the CIP 
icing potential field is compared AIRMETS. 


C3P Evaluations {i - PODn) v. PODy(mog) 



Figure 3: Comparison of PODy(mog) vs. 1-PODn for 
CIP over winter 2003 and for times valid for 
Benchmarking period (Oct’03 - Mar’04). 


October 2003 - March 2004 (1 - PODn) v. PODy(mog) 



1 - PODn 

Figure 4:. Same as Figure 3, except for CIP and 
AIRMETs. 

4. NASA LaRC GOES-DERIVED CLOUD 

PRODUCTS (GDCP) VERIFICATION 

The GDCP produces several products. The reason 
that we began these benchmarking studies with the 
cloud phase product is its direct and obvious correlation 
to the CIP icing potential. If the phase of the cloud 
diagnosed at any given pixel is liquid with temperature 
below 0 °C, one can assume that there is potential for 
icing because of the presence of supercooled liquid 
water (SLW). 

The verification study for the GDCP is done using 
PIREPs of positive and negative icing and is very similar 
to the CIP verification. Although the verification 
methods are essentially the same, some changes are 
inevitable because of the differences in the products. 
CIP is a three-dimensional product, able to be matched 
to any PIREP in its domain; the GDCP are two- 
dimensional and can only be considered valid near 
cloud top. Further, the GDCP also have a finer 
horizontal resolution than CIP. 
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4.1 Methods 

To ensure a fair verification using established 
techniques the products must first be put on the same 
grid. The CIP is run on the 20-km Rapid Update Cycle 
(RUC) model grid. The GDCP, derived from GOES 
pixels with a nominal 4-5 km resolution, have been 
remapped to a RUC projection, but with a 5-km grid. To 
ensure that the higher resolution satellite products are 
not penalized for their increased horizontal resolution, 
the analysis has to be extended to cover the same 
special domain as used in the CIP verification already 
discussed. In the CIP study, each PIREP mandated an 
examination of the four adjacent CIP grid points. To 
cover the same area for evaluating the GDCP each 
PIREP requires an examination of 64 GDCP cloud 
phase pixels. These spatially-devolved GDCP grid 
points then became the basis of comparison to the 
PIREPs. 

For the GDCP the SLW condensate phase is 
thresholded under the assumption that the presence of 
SLW equates to a potential for icing. For this evaluation 
eight thresholds were used: 8, 16, 24, 32, 40, 48, 56, 
and 64 pixels. 


4.2 Data 

The valid time period for this evaluation is the same 
as for the CIP — 01 October 2003 to 31 March 2004 
with valid times of 1445 and 2045 UTC (the nearest 
match to 1500 and 2100 UTC). A total of 316 GDCP 
files were available for verification with each file 
representing a combination of GOES-10 and GOES-12 
data sets that have been stitched together at longitude 
99.5° W. 

Because the cloud phase observation is only valid 
at cloud top, eight different methods were employed in 
an attempt to discover the best way to determine 
whether or not a PIREP was close enough to the 
defined cloud top height. Each method was assigned a 
flag value and are described in Table 5. This table 
shows the counts for the numbers of PIREPs located 
within certain distances of the GDCP Cloud Top Height 
(CTH) measurements. These different groups of obser- 
vations are used to perform separate analyses on 
GDCP. 

4.3 Results 

The overall results are diagrammed in Figures 5 
and 6. Figure 5 is an ROC diagram of CIP, AIRMET, 
and GDCP for Flag 1. As shown in Figures 2 and 3, 
CIP shows good skill with a large area under the ROC 
curve. The single AIRMET data point is located just 
below the CIP line. The GDCP (cloud phase) has 
positive area under the curve and thus, positive skill, but 
has less area than the CIP. Figure 6 shows results for 
an evaluation similar to that shown in Figure 5 but for 
Flag 2. The results are similar in that CIP and the 


Table 5: Counts of numbers of observations used 
in the verification of GDCP. 


FLAG 

All PIREPs within ... 

YES 

MOG 

NO 

1 

+/- 1,000 ft of median CTH 
(All Cloudy pixels) 

134 

1620 

2 

+/- 3,000 ft of median CTH 
(All Cloudy pixels) 

349 

3936 

3 

+/- 1,000 ft of median CTH 
(All SLW pixels) 

130 

1059 

4 

+/- 3,000 ft of median CTH 
(All SLW pixels) 

328 

2790 

5 

+/- 1,000 ft of min/max 
CTH 

(All Cloudy pixels) 

643 

6244 

6 

+/- 3,000 ft of min/max 
CTH 

(All Cloudy pixels) 

828 

8229 

7 

+/- 1,000 ft of min/max 
CTH 

(All SLW pixels) 

393 

3300 

8 

+/- 3,000 ft of min/max 
CTH 

(All SLW pixels) 

552 

4827 


Oc\ 2003 ■ Mar 2004 (1 - POOn) v, PODy(mog) - 1 



Figure 5: Comparison of PODy(MOG) vs. 1-PODn for 
CIP, AIRMETs, and GDCP using reports +/- 1,000 ft of 
the GDCP median CTH value for all cloudy pixels. 


AIRMET show more skill than GDCP. In these two 
figures the PODy(MOG) values show that GDCP has 
some success in diagnosing the positive icing areas, 
assuming that the presence of SLW shows a potential 
for icing. The low PODn values are likely the reason for 
the poor skill of GDCP using all eight thresholds. 

Table 6 shows the verification results using specific 
thresholds for both CIP (0.05 and 0.15) and GDCP for 
all flag values (see Table 5). These are compared with 
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Oct 2003 - Mar 2004 (1 - PODn) v. FODy(mog) 2 



Figure 6: Comparison of PODy(MOG) vs. 1-PODn for 
CIP, AIRMETs, and GDCP using reports +/- 3,000 ft of 
the GDCP median CTH value for all cloudy pixels. 


the results for the AIRMETs. The results show that the 
PODy(MOG) values for GDCP ranged from 0.578 to 
0.822 and were somewhat comparable to the CIP 
(0.744 and 0.828) and the AIRMETs (0.751). The 
PODn results are not as remarkable as the PODy(MOG) 
results with scores ranging from 0.195 to 0.536 versus 
CIP (0.604 and 0.671) and the AIRMETs (0.635). The 
low PODn for GDCP scores are also responsible for 
bringing down the True Skill Statistics (TSSs), which 
ranges from 0.002 to 0.206 compared to CIP (0.432 and 
0.415) and the AIRMETs (0.386). Overall, the 
evaluation using GDCP-1 and GDCP-2 (Table 6) shows 
the best results for the GDCP with TSSs of 0.178 and 
0.206 respectively. 

The poor results for the GDCP for PODn can be 
attributed to the nature of the product and the 
verification methods. Since the GDCP product is two- 
dimensional and is only valid near cloud-top, a smaller 
set of observations was used for verification. The area 
that was verified contained SLW and an assumption of 
positive icing potential was made in order to obtain a 
verifiable field. The problem with this assumption is 
that the presence of SLW does not necessarily imply 
that the area has icing. In the case where SLW is 
present but icing is not, a negative report of icing was 
recorded as a miss. Another apparent problem with the 
verification method is the lack of an exact CTH 
measurement. If a median cloud-top height 
measurement is too high and there is a negative report 
of icing in an area just above the real cloud top 
containing SLW, then a miss is recorded. 

With the verification limited only to the areas around the 
GDCP CTH measurement, the low values of PODn are 
not surprising. After all, the GDCP only attempts to 
detect areas of positive SLW and thus positive icing 
potential. If clear areas are added to the verification (i.e. 
areas above the maximum CTH measurement) then the 
PODn results would likely improve. However, addition 
of the negative icing areas would most likely have a 
negative effect on the PODy(MOG) statistics. 


Table 6: Statistics for CIP, AIRMETs, and GDCP at 
Specific Thresholds. 


Product 

PODy(MOG) 

PODno 

TSS 

CIP (.05) 

0.828 

0.604 

0.432 

CIP (.15) 

0.744 

0.671 

0.415 

AIRMET 

0.751 

0.635 

0.386 

GDCP-1 

0.679 

0.527 

0.206 

GDCP-2 

0.642 

0.536 

0.178 

GDCP-3 

0.708 

0.294 

0.002 

GDCP-4 

0.741 

0.306 

0.047 

GDCP-5 

0.579 

0.481 

0.060 

GDCP-6 

0.578 

0.499 

0.077 

GDCP-7 

0.822 

0.195 

0.017 

GDCP-8 

0.790 

0.241 

0.031 


Another potential problem with this verification 
approach is the difficulty in evaluating a high-resolution 
observational product using a low-resolution verification 
data set (PIREPs in this case). While this can, in part, 
be compensated for by trying to combine many of the 
high-resolution data pixels into larger spatial blocks, this 
approach will necessarily reduce the PODn, as we have 
seen in this analysis. 

4.4 Plans for Further Analyses 

In addition to cloud phase, the GDCP include a 
number of additional products that may be of use in CIP 
and which will be verified as these studies progress. 
These products include icing risk, liquid water path, 
water drop radius, optical depth, cloud top pressure, and 
the cloud base and top heights. It is likely, however, 
that a different sort of verification will be needed to 
judge the effective-ness of some of these products; 
possibly using PIREP reported icing intensity and/or 
research aircraft data. 

In addition to the PIREP-based verification, we also 
plan to conduct a direct comparison of CIP and selected 
GDCP outputs including the total areas covered as well 
as overlapping and non-overlapping areas. In addition, 
we hope to derive statistics such as efficiency (POD 
divided by total area) to help us determine how the 
GDCP can best be incorporated into CIP. 
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5. SUMMARY 

Although encouraging, the methods and results 
presented here should be considered preliminary. CIP 
already does an excellent job at diagnosing the potential 
for icing on a 20-km scale, but we hope to do better. 
Our plans for this product include the depiction of icing 
severity, better information on locations of supercooled 
large droplets (SLD), and higher resolution products for 
the terminal area. We have only modest skill to date on 
severity and SLD and hope that the advanced satellite 
products will help increase our skill. For the terminal 
area, resolution of 5 km, or smaller, and time scales of 
15-30 min or shorter are needed, and these cannot be 
accomplished without satellite support. We already 
know, in a qualitative manner, the value of the advanced 
satellite products through our experience using them in 
field project forecasting exercises and it is now time to 
quantify their accuracy. 

We know have available a number of unique sets of 
observations that should enable us to determine which 
advanced satellite products are the best for 
incorporation into CIP. We seek increased efficiency 
(greatest icing detection with smallest overwarning) and 
an improved icing severity algorithm. We also expect 
that the advanced satellite cloud products will enable 
more accurate icing diagnosis in areas where surface 
data are sparse, for example, over oceans and Alaska. 
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